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Abstract 

Bayesian inferences in high energy physics often use uniform prior distributions for parameters 
about which little or no information is available before data are collected. The resulting poste- 
rior distributions are therefore sensitive to the choice of parametrization for the problem and may 
even be improper if this choice is not carefully considered. Here we describe an extensively tested 
methodology, known as reference analysis, which allows one to construct parametrization-invariant 
priors that embody the notion of minimal informativeness in a mathematically well-defined sense. 
We apply this methodology to general cross section measurements and show that it yields sen- 
sible results. A recent measurement of the single top quark cross section illustrates the relevant 
techniques in a realistic situation. 
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I. INTRODUCTION 

The Bayesian approach [If to inference plays an increasingly important role in particle 
physics research. This is due, in part, to a better understanding of Bayesian reasoning 
within the field and the concomitant abating of the frequentist /Bayesian debate. Moreover, 
the small but growing number of successful applications provide concrete examples of how 
the Bayesian approach fares in practice. 

In spite of these successes the specification of priors in a principled way remains a concep- 
tual and practical hurdle. In the so-called subjective Bayesian approach [2|, one is invited to 
elicit the prior based on one's actual beliefs about the unknown parameters in the problem. 
If one has well-understood information, for example based on subsidiary measurements or 
simulation studies, one can encode this partial information in an evidence-based prior 
Such priors generally occasion little or no controversy. On the other hand, if one knows little 
about a given parameter, or if one prefers to act as if one knows little, then it is far from 
clear how one ought to encode this minimal information in a prior probability. 

Since there is, in fact, no unique way to model prior ignorance, a viewpoint has evolved 
in which this lack of knowledge is represented by one's willingness to adopt a standard 
prior for certain parameters 4|, just as one has adopted a standard for quantities such as 
length and weight. In this spirit, our field adopted as a convention a uniform (flat) prior for 
unknown cross sections and other parameters (see for example Ref. [ij]), mainly because this 
prescription is simple to implement and seems to embody Laplace's principle of insufficient 
reason. Unfortunately, uniform priors are both conceptually and practically flawed. The 
conceptual difficulty is with their justification: lack of knowledge about a parameter a implies 
lack of knowledge about any one-to-one transform o' of er, and yet a prior distribution that 
is uniform in a will not be so in a' if the transform is non-linear. The practical problem is 
that careless use of uniform priors can lead to improper posteriors, that is, posteriors whose 
integrals are infinite and which can therefore not be used to assign meaningful probabilities 
to subsets of parameter space. An example of this pathology is found in a common method 
for reporting the exclusion of a new physics signal, where one estimates an upper limit from 
a posterior distribution for the signal's production cross section. When constructed from 
a Poisson probability mass function for the observations, a flat prior for the signal cross 
section, and a truncated Gaussian prior for the signal acceptance, this posterior is actually 
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improper. However, for small acceptance uncertainties the divergence of the upper limit is 
often concealed by the inevitable truncation of numerical computations jsj. 

The specification of priors that encode minimal information is of such importance in 
practice that a large body of literature exists describing attempts to construct priors that 
yield results with provably useful characteristics. These priors are typically arrived at using 
formal rules. In this paper, therefore, we refer to them as formal priors |7[ to distinguish 
them from evidence-based priors. Many such formal rules exist [4]. In this paper we study, 
and then recommend, a rule which is arguably the most successful: that developed by 
Bernardo js] and Berger and Bernardo js-ll]. Formal priors constructed according to the 
Bernardo-Berger rule are called reference priors, a somewhat unfortunate name given that 
the term reference prior is sometimes used as a synonym for what we have called a formal 
prior. 

Reference priors have been shown to yield results with several desirable properties, all 
of which should appeal to particle physicists. Therefore, in principle such priors could be 
a foundation for Bayesian inference in particle physics research. However, reference priors 
and the associated methods collectively referred to as reference analysis 12|, |13J] have yet 
to enter the field in a significant way. The purpose of this paper is to initiate this process 
by applying the Bernardo-Berger method to a familiar, but important class of problems, 
namely that of calculating posterior densities for signal cross sections. 

In the next section we describe the general goals of reference prior construction and show 
how these are implemented via the concept of missing information. For simplicity we limit 
that discussion to one-parameter problems. Section HTT1 then considers the treatment of nui- 
sance parameters about which prior information is available. Examples of such parameters 
include detector calibration constants, background contaminations, geometrical acceptances, 
and integrated luminosities. We describe two methods for handling these parameters, de- 
pending on the type of information that is available about them. These methods are then 
applied to counting experiments with uncertain background contamination and effective lu- 
minosity. In the simplest cases we have obtained analytical expressions for the marginal 
posterior for the quantity of interest. For the general case we have developed a numerical 
algorithm. Some appealing properties of these posteriors are examined in Sec. IIVI In Sec. |V] 
the reference prior methodology is applied to a recent measurement of the production cross 
section for single top quarks at the Tevatron. Final comments are presented in Sec. IVII 
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II. REFERENCE PRIORS 



In 1979, Bernardo 



is 



introduced a formal rule for constructing what he called a reference 
prior. The goal was to construct a prior which, in a sense to be made precise, contained 
as little information as possible relative to the statistical model under consideration. By 
statistical model he meant a representation of the entire experimental design, including the 
probability distribution of the data, the sampling space, and the stopping rule. Hence, by 
construction reference priors depend on all these aspects of a statistical model, and so will 
inferences derived from data with the help of a reference prior. This may seem to violate 
the so-called likelihood principle [l4], according to which all the information about unknown 
model parameters obtainable from an experiment is contained in the likelihood function, 
i.e. the probability distribution of the data, evaluated at the observations and viewed as 
a function of the parameters. While this is formally true, it should be kept in mind that 
the likelihood principle applies after data have been observed, whereas reference priors are 
constructed at the experimental design stage. Their purpose is to approximate a consensus 
of opinions that is suitable for scientific communication. This is generally unproblematic in 
large-sample situations, where posterior inferences are dominated by the likelihood function. 
In small sample cases however, results obtained with reference priors should be considered 
preliminary, and a careful study should be conducted of the degree to which inferences 
about the physics model underlying the observations can be trusted. This can be achieved 
by examining the sensitivity of the results to changes in the prior, and subsequently assessing 
the need for additional observations. 

Reference priors have several desirable properties, including 

1. generality: a well-defined algorithm exists to create a reference prior for almost any 
type of estimation problem, and the resulting posterior is proper; 

2. invariance: given a one-to-one map from a parameter 9 to a parameter <fr, applying 
the reference prior construction separately to 9 and <fi yields posteriors that are related 
by the correct transformation law, n{cf)\x) = it (9 | x) \d9/d<p\] 

3. sampling consistency: the posterior densities from an ensemble of experiments tend to 
cluster around the true values of the parameters; and 

4. coherence: inferences derived from reference priors avoid marginalization paradoxes. 



Path 1: P< " X ' °' ^' \ +- n(9, <f> \ x) J n{9, cf>\x)d<f> = tt(9 \ x) = n x {9 \ t) 



Path2: *\eA)=v{t\0 ^ 



— ? 



FIG. 1: Let x be a dataset modeled by the probability density p(x\6,4>), where 9 and <f> are 
unknown parameters, and consider the following two paths to a posterior density for 9. In path 1, 
we use a formal prior 7r F1 (9,4>) to construct the joint posterior for 9 and <f>, and then integrate out 
4>. Suppose that the result of this operation only depends on the data x through the statistic t; 
this gives us ir^O 1 1). For path 2, assume further that the sampling distribution of t only depends 
on 6. We can then directly construct a posterior for 9, say ir 2 (9\t). A marginalization paradox 
occurs if 71"! (0 \t) ^ tt 2 (9 \ t) regardless of the choice of prior tt 2 (9) in path 2. 

Marginalization paradoxes [15[ arise in multiparameter problems when a posterior density 
can be calculated in different ways that ought to give the same answer but do not (see Fig. [1]). 
This incoherence does not happen with subjective or evidence-based priors because these 
priors are always proper. With formal priors however, it can only be avoided by allowing the 
joint prior for all the parameters in a given statistical model to depend on the quantity of 
interest. This is in fact what the reference prior construction does. For a simple illustration, 
consider n measurements Xi from the normal model with unknown mean \i and standard 
deviation a. The likelihood function is: 

_1 ( Xj-fj, \ 2 _ n-l I s\ 2 _n ( x-fi \ 2 

TT e2\cr/ e 2 \<j) 2 V a ) 

p{x\fi,a) = [[ = = , (1) 

where x = Yll x il n an d s 2 — Ylli. x i ~ x) 2 /{n — 1). When \i is the quantity of interest, the 
reference prior derived from this likelihood is \ jo. Restricting the remaining calculations to 
the case n = 2 for convenience, the joint reference posterior for /i and a is then: 

7r(n,a\x) = ——e~2\a) IJ. (2) 
7T a 6 

Integrating out a yields the marginal //-posterior, which is a Cauchy distribution with lo- 
cation parameter x and scale parameter s/y/2. Suppose however that our interest lies in 



the standardized mean 9 = fi/a. In a non-reference approach one would perform the trans- 
formation (/i, a) — > (9, a) in Eq. (j2]) and integrate out a in order to obtain the marginal 
^-posterior. The latter only depends on the data through the statistic t = y/2x/s: 



e W 2 

it 9 x) = 



1 + erf 



P(0\t), (3) 



'tt(1 +t 2 ) 

where erf is the error function. Furthermore, the sampling distribution of t turns out to 
depend on 9 only and is a noncentral Student's t distribution for one degree of freedom and 
with noncentrality parameter 9: 



1 + erf 



P^l 9 ) = ^T~^ + 



(4) 



7T(l+t 2 ) ^(l+t 2 ) 3 / 2 

It is clear that there exists no prior (no function of 9 only) that, multiplied by the likeli- 
hood @, leads to the posterior fl3]). Hence the marginalization paradox: someone who is 
only given the data value of the statistic t will be able to make inferences about the pa- 
rameter 9, but these inferences are guaranteed to disagree with those previously made by 
the Bayesian who had access to the full dataset. Resolution of this paradox hinges on the 
realization that lack of information about fi is not the same as lack of information about 
9. Therefore, the choice of which quantity is of interest must be done before calculating the 
prior. Since reference priors are derived from the likelihood function, the latter must first 
be expressed in terms of the relevant parameters, 9 and a: 

n—l ( s\ 2 nix „\ 2 

p(x\9,a) = . (5) 

(V CTj 

Applying the reference algorithm to this likelihood while treating 9 as the quantity of in- 



terest yields the prior ir(9,a) = l/(o- ^l + 2 /2), which is very different from the prior 1 jo 



obtained by treating fi as the quantity of interest. The resulting reference posterior suffers no 
marginalization problems. Further details about this example can be found in Refs. 

Our discussion of marginalization also helps to clarify the behavior of reference posteriors 
under transformations in the multiparameter setting. Reference posteriors are invariant 
under one-to-one transformations of the parameter of interest, but not under transformations 
that redefine the parameter of interest by mixing in one or more nuisance parameters. 
However, redefining the nuisance parameters is permitted. Suppose for example that ip is the 
parameter of interest, v the nuisance parameter(s), and consider an invertible transformation 



of the form ((p,v) —> (y?, A), where A is a function of both ip and v. Then the reference 
posterior for (p is unchanged by the transformation. 

Reference priors on unbounded parameter spaces are usually improper, which invalidates 
the application of Bayes' theorem. To circumvent this problem one introduces a nested 
sequence of compact subsets 0i C 2 C . . . of the parameter space 6, such that — > G 
as i —7- oo. Given an improper prior tt(6), its restriction to G^ will be proper, so that 
Bayes' theorem can be applied to construct the corresponding restricted posterior 7r^(9\x). 
The unrestricted posterior for the entire parameter space is then defined by the limit of the 
tti(8\x) as I —>■ oo. The practical justification for this procedure is that one often knows 
the shape, but not the size, of the physical region of parameter space where the prior has 
nonzero weight. As this size is typically very large, the limiting posterior can be viewed as 
an approximation to the posterior on the physical region. 

Interestingly, the limiting posterior can also be obtained by direct, formal application of 
Bayes' theorem to the improper prior ir(9), provided the marginal distribution of the data, 

Mr) , jMim*, (•) 

is finite. It can then be shown that the restricted posteriors 7ie(9\x) converge logarithmically 
to their limit it(9\x): 

lim D[ir t (9\x), tt(0|x)] = 0, (7) 

t— >oo 

where 

D\p(9),q(e)) = J q (0)\og^jrde (8) 

is the Kullback-Leibler divergence between p(8) and q{9). This divergence is a 
parametrization- independent, non- negative measure of the separation between two densi- 
ties; it is zero if and only if the densities are identical. Unfortunately, pointwise logarithmic 
convergence is not enough to avoid inferential inconsistency in some special cases [if]], so 
that a stronger form of convergence is needed, expected logarithmic convergence: 

\imE m \D[n e (e\x),n(9\x)]} = 0, (9) 

where the expectation is taken with respect to the marginal density m(x). 

The above discussion motivates the following terminology [16( . Given a statistical model 
on a parameter space G, a standard prior is a strictly positive and continuous function on G 
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that yields a proper posterior. A permissible prior is a standard prior for which the posterior 
is the expected logarithmic limit of a sequence of posteriors defined by restriction to compact 
sets. 



A. The Concept of Missing Information 

Reference priors make use of the notion of expected intrinsic information. For one obser- 
vation from a model p(x\9), the expected intrinsic information about the value of 9 when 
the prior is ir(9) is given by the functional 

/{vr} = E m {D[n(9\x), tt(0)]}. (10) 

The more informative the observation, the greater the expected separation between the 
posterior and the prior. The larger this separation, the greater the expected intrinsic infor- 
mation I{n}. Thus, I {71} measures the amount of information about the value of 9 that 
might be expected from one observation when the prior is n(9). 

Suppose next that we make k independent observations x^) = {xi,x 2 , • • • ,Xk} from the 
model p(x\9). The definition of expected intrinsic information can be generalized to include 
all k observations: 

I k {n} = E m [D[7r(9\x {k) ), tt(0)]}, (11) 
where the expectation is a /c-dimensional integral over x^) weighted by: 

k 



m(x {k) ) = J p(x {k) \9)n(9)d9 



7i(9)d9. (12) 



As the sample size k grows larger, one expects the amount of information about 9 to increase, 
and in the limit k — >■ 00, the true value of 9 would become exactly known. In this sense, the 
limit Ioo{vr} = lim^oo I k {n} represents the missing information about 9 when ix{9) is the 
prior. This concept of missing information is central to the construction of reference priors. 

B. Reference Priors for One-Parameter Models 

The goal of reference analysis is to contruct a prior that maximizes the missing informa- 
tion. This maximization cannot be done directly however, because 1^ typically diverges. 
To avoid this problem, one first constructs the prior n k (9) that maximizes I k , and then 
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takes the limit of 7Tk(9) as k — > oo. Additional care is required when the parameter space 
is unbounded, since in that case the prior that maximizes Ik is often improper, and Ik is 
undefined for improper priors. The solution is to define reference priors via their restrictions 
on arbitrary compact subsets 0^ of 0. Thus one is led to the formal definition of a reference 
prior for 9 as any permissible prior txr{9) that satisfies the so-called maximizing missing 
information (MMI) property, namely that 



lim 

fc— ¥00 



> 



(13) 



for any compact set 0^ and candidate prior tt(6), where ttr^ and Hi are the renormalized 
restrictions of ttr and 7r to 0£. A candidate prior is a standard prior that incorporates any 
prior knowledge about 9. 

A key result is the following constructive definition of the reference prior t[r{9) [if]], 

rr k (9) 



with 7Tfc(0) 



, lim fa \ ' 
exp <^ / p(x( k ) | 9) In 



p( x (k) 1 


9) h{9) 


/ p( x (k) 1 1 


T) h{9) d9 



dx 



(*) 



(14) 



where is an arbitrary fixed point in 0, h{9) is any continuous, strictly positive function, 



such as h{9) = 1, and p(x 



(fe) 



Yli=i P( x i I ^) i s ^ ne probability model for a sample of k 



independent observations. We emphasize that this constructive definition only guarantees 
that the MMI property (fl~3|) is satisfied. The permissibility part of the reference prior 



definition must be separately verified. However, the proponents of re 



erence priors view the 
rat 



MMI property as considerably more important than permissibility 16| . and also believe t 
it would be highly unusual for a prior satisfying the MMI property to fail permissibility [U | 
(counter-examples are known, but they are rather exotic). 

A further useful result that we shall exploit is that, when certain regularity conditions 
are met — essentially those that guarantee asymptotic normality of the posterior — the 
reference prior for models with one continuous parameter reduces to the well-known Jeffreys 



prior 



k r (9) 



d? 



d9 2 



lnp(x 



(15) 



where the expectation is taken with respect to the sampling model p(x \9). In general the 
analytical derivation of reference priors can be extremely challenging. However, Eq. (fl~4"]) is 
amenable to numerical integration 



We emphasize that the definition and results described in this section apply only to the 
case where the model of interest depends on a single parameter. A generalization to the 
multi-parameter case has been formulated and shown to have the properties listed at the 
beginning of Sec. [TT] 12] . We shall not describe it here however, except for when evidence- 
based priors are specified for the additional parameters. This is a very common situation in 
high energy physics, and will be discussed next. 



III. NUISANCE PARAMETERS 

The reference prior algorithm described in Sec. IHBI pertains to models containing no 
nuisance parameters. In practice, however, every non-trivial problem must contend with 
such parameters and the reference prior algorithm must be generalized accordingly. In 
this paper we restrict our attention to nuisance parameters for which partial information is 
available, which is often the case in practice. 

Depending on the type of partial information that is available, there are two plausible 
ways one might choose to incorporate nuisance parameters into the calculation of the 
reference priors for a parameter of interest 9 \\\ : 

Method 1: Assume that we are given a marginal prior 7r(</>) for the nuisance parameters; 
compute the conditional reference prior tir{9 \ 4>) for the interest parameter given a 
fixed value of 0; the full prior is then n(9, 0) = ir R (9 \ 0) 7r(0); 

Method 2: Assume that we are given a conditional prior ?r(0 | 9) for the nuisance parameter 
given the interest parameter; marginalize the probability model p(x\9, 0) with respect 
to in order to obtain p(x\9) = J p{x\9, 0) 7r(0|#) d(j), and compute the reference prior 
ttr(9) for the marginalized model; the full prior is then n(9, 0) = 7r(<f) | 9) 7Tr(9). 

In many high energy physics measurements there are often sound reasons for assuming that 
the nuisance parameter is independent of the parameter of interest. Information about a 
detector energy scale, for example, is typically determined separately from the measurement 
of interest, say of a particle mass, and is therefore considered to be independent a priori 
from one's information about the particle's mass. When an experimenter is willing to make 
this assumption, he or she can declare that 7r(0 | 9) = 7r(0) and use Method 2. When this 
assumption does not seem fully justified, and it is too difficult to elicit the 9 dependence 
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of 7r(0 I 9), then it will seem preferable to use Method 1, which only requires knowledge of 
the marginal prior 7r(0). When one is unsure of which method to use, one should use both, 
and treat the results as part of a test of robustness. An important practical advantage of 
Method 1 is that the conditional reference prior is computed once and for all, for a given 
model, and can be used with any evidence-based prior for the nuisance parameters. In 
contrast, for Method 2 the reference prior must be computed anew every time the priors for 
the nuisance parameters change. On the other hand, since Method 2 reduces the problem to 
one involving a single parameter, the reference prior algorithm reduces to Jeffreys' rule (TT5]) . 
which is typically easier to implement. 

In the next section we introduce the basic model studied in this paper, and follow with 
the application of Methods 1 and 2 to that model. 

A. The Single-Count Model 

A very common model for high energy physics measurements is the following. A number 
of events N is observed by some apparatus, and it is assumed that N is Poisson distributed 
with mean count ea + /i, where a is the rate of a physics signal process, typically the cross 
section, which we detect with an effective integrated luminosity e — that is, the integrated 
luminosity scaled by the signal efficiency, and fi is a background contamination. Thus, a is 
the parameter of interest, whereas e and /i are nuisance parameters for which we usually have 
partial information. For physical reasons none of these three parameters can be negative. 
We write the likelihood for this model as 

p(n\a,e,Li) = e ea M with < o < oo and < e, fi< oo. (16) 

Information about e and fi usually comes from a variety of sources, such as auxiliary mea- 
surements, Monte Carlo simulations, theoretical calculations, and evidence-based beliefs (for 
example, some sources of background contributing to fi may be deemed small enough to ig- 
nore, and some physics effects on e, such as gluon radiation, may be believed to be well 
enough reproduced by the simulation to be reliable "within a factor of 2"). It is therefore 
natural to represent that information by an evidence-based prior. Here we will assume that 
e and /i are independent of a and that their prior factorizes as a product of two gamma 
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densities: 

, N . , aiaeY- 1 ' 2 e~ ae b(bu)v-V 2 e- h " . , 

*(«, „ | a) = A = ^ - m V(, + l/2) ' (17) 

where a, 6, x, and y are known constants, related to the means e, p, and coefficients of 

variation 5e, 5fi by: 

- 5e= ; , P=^ L -r A , 5n= (18) 

x+i Jy + \ 



2 V 2 

The built-in assumption that e and /i are uncorrelated is clearly an approximation, since 
they share a dependence on the integrated luminosity, which is itself uncertain. 

There are two ways of interpreting this prior. The first one is appropriate when informa- 
tion about e and /J comes from one or more non-experimental sources, such as Monte Carlo 
studies and theoretical calculations, and takes the form of a central value plus an uncertainty. 
Since the e and /x components of the prior are each modeled by a two-parameter density, 
one can fix the shape of this density in each case by matching its mean with the central 
value of the corresponding measurement and its standard deviation with the uncertainty. 
It will then be necessary to check the robustness of the final analysis results to reasonable 
changes in this procedure. For example, one may want to replace the gamma distribution 
by a log-normal or truncated Gaussian one, and the mean by the mode or median. 

The second interpretation of prior (|T7|) follows from the analysis of two independent, aux- 
iliary Poisson measurements, in which the observed number of events is x for the effective 
luminosity and y for the background. The expected numbers of events in these auxiliary 
measurements are ae and bfi, respectively. For a Poisson likelihood with mean ae the refer- 
ence prior coincides with Jeffreys' prior and is proportional to 1/y/e. Given a measurement 
x, the posterior will then be a gamma distribution with shape parameter x + 1/2 and scale 
parameter 1/a. A similar result holds for the background measurement. In this manner the 
prior (fTTl) is obtained as a joint reference posterior from two auxiliary measurements. 

The problem we are interested in is finding a prior for a, about which either little is 
known or one wishes to act as if this is so. 



1 . Application of Method 1 to the Single- Count Model 



This section serves two purposes: to illustrate the analytical algorithm for computing 
reference priors and to apply Method 1 to model (flBl) . 
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In Method 1 [l7[], we find first the conditional reference prior t^r{<j | e, fi) and then multiply 
by the evidence-based prior ir(e, fi) to construct the full prior ir(a, e, //). As will be illustrated 
in Sec. IIV} the single-count model is regular enough to warrant using Jeffreys' rule in the 
first step of the calculation of tt r (<j | e, fi). We therefore apply Eq. ffTB^) to the a dependence 
of the likelihood (jlfip . while holding e and fi constant; this yields: 



7Tj(a\e,ii) x 



Q2 

\np(n\a, e, fi) 



oc - (19) 



da 2 

This prior is clearly improper with respect to a and is therefore only defined up to a pro- 
portionality constant. However, this constant could very well depend on e and fi, since we 
kept these parameters fixed in the calculation. It is important to obtain this dependence 
correctly, as examples have shown that otherwise inconsistent Bayes estimators may result. 
Reference 17| proposes a compact subset normalization procedure. One starts by choosing 
a nested sequence 81 C 62 C • • • of compact subsets of the parameter space = {(er, e, /i)}, 
such that UiOi = 9 and the integral Kt(e, fi) of 7ij(a | e, fi) over Qi = {a : (a, e, //) G 6^} is 
finite. The conditional reference prior for a on Qi is then 

/ I \ 7Tj(<7|e,^) . . 

Kt{e,/i) 

To obtain the conditional reference prior on the whole parameter space, one chooses a fixed 
point (o"o,eo,yUo) within that space and takes the limit of the ratio 

ttr{<j I e, jj,) oc hm — : -. (21) 

£->oo ir R/ {ao I e , Ato) 

By taking the limit in this ratio form, one avoids problems arising from Kf(e, fi) becoming 
infinite as i — > 00. 

The theory of reference priors currently does not provide guidelines for choosing the 
compact sets G^, other than to require that the resulting posterior be proper. In most 
cases this choice makes no difference and one is free to base the choice of compact sets on 
considerations of simplicity and convenience. However, we have found that some care is 
required with the single-count model. Indeed, suppose we make the plausible choice 

©£ = {(<T,e,/i) : a G [0,u e ], e G [0,v e ], fj, G [0,^]|, (22) 

where {ue}, {ve}, and {we} are increasing sequences of positive constants. If we use these 
sets in applying Eqs. fl20l) and fl2TT) to the prior f|T9l . we obtain: 



Tr R (a I e, fi) x J ■ — . (23) 
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Although this prior is still improper with respect to a, its dependence on e is different from 
that of the conditional Jeffreys' prior, Eq. f fT9|) . This demonstrates the potential importance 
of the compact subset normalization. The prior in Eq. (123 p has a serious problem however. 
Suppose that the e marginal of our evidence-based prior for e and /i is exp(— e)j^f¥e. It 
is then easy to verify that the resulting posterior is improper, since its e marginal has the 
non-integrable form exp(— e)/e. The cause of this problem is the choice of compact sets (1221) . 

Fortunately it is not difficult to find a sequence of compact sets that will provide a proper 
posterior. Indeed, the a dependence of the prior (fT9|) suggests that the compact sets should 
be based on the parametrization (ea, e, /i) rather than (a, e, /i) We therefore set: 

©£ = |(a,e,/i) : o G [0,u e /e], e G [l/v e ,v e ], V G [0,1/;*]}, (24) 

where ue, v#, and we are as before. Again using Eqs. ( fl9l) . ( f20i) . and (ED), we now find: 

7imHe,/i) oc 6 , (25) 

which is identical to Jeffreys' prior for this problem and yields well-behaved posteriors. For 
future use, the subscript Rl on the left-hand side indicates that this reference prior was 
obtained with Method 1. 

We now have all the ingredients needed to calculate the marginal reference posterior 
tt.ri(0" I n) for the cross section a: the likelihood (fl6|) . the marginal nuisance prior (fl7|) . and 
the conditional reference prior (|25|) . For calculating posterior summaries in terms of intervals 
and upper limits it is convenient to express the result as a tail probability: 

/ i \ r f 1 u«+y (i - u)*-* ^q+^f ) (v + i n + \) _ 

Ttm(T\n)dT = / — t 7t t 1 7t du 26 

J^ B i n + y + l ^ + \) ^(y+|,n+i) 

where 

= / t M_1 (1 - ty- 1 dt (27) 
is the incomplete beta function, and B(u,v) = Bi(u,v) = T(u)T(v)/T(u + v). 



2. Application of Method 2 to the Single- Count Model 

In contrast with Method 1, Method 2 requires from the start that we specify the evidence- 
based prior for the effective integrated luminosity e and the background contamination /x. 
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Furthermore, this specification must be done conditionally on the signal rate a. As men- 
tioned earlier, we will use expression f JTTj) for this prior. 

The next step in the application of Method 2 is to marginalize the probability model ( II 6p 
with respect to e and \i: 



pin | a) = J J p(n\ a, e, jj) Ti(e, jj\ a) de dfx, 



(e<7 + fi) r 



a[ae) 



7 


n\ 




a 




" b ' 


a + a 




b + 1 



b(bu)y~l 



- b " dedfjL, 



y+h 



S°Ja) 



(28) 



where 



or 



cr 



k=0 



k + x — ^\ fn — k + y — \ 



k 



n — k 



1 


n—k 


a 






a + a 



(29) 



and the binomial coefficients are expressed in terms of gamma functions to accomodate 
noninteger values of their arguments. Finally, the reference prior algorithm must be applied 
to the marginalized model p(n | cr). As in the case of Method 1, the conditions for applying 
Jeffreys' rule are satisfied here; we therefore obtain: 



E 



(a + cr)*+ 5 /2 S°{a) 



(30) 



We will use the notation 7TR 2 (cr) to refer to the marginal reference prior for cr obtained 
with Method 2. Note that the compact subset argument invoked in the construction of the 
Method 1 reference prior is not needed here because all the parameters other than a have 
already been eliminated by marginalization. 

For Method 2 the marginal reference posterior for a is proportional to the product of the 
marginal data probability distribution (T28l and the marginal reference prior (1301) : 



7i R2 (a\n) oc p(n\ a) ir R2 {cr). 
The normalization of tt R2 (& \ n ) must be obtained numerically. 



(31) 
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B. The Multiple- Count Model 



An important generalization of the single-count model is obtained by considering M 
replications of the latter; the likelihood is: 

M ( I \n- 

To obtain the Method 1 reference prior for this model, we first calculate Jeffreys' prior for 
cr, while keeping e and jl fixed: 



7Tj(cr\e,i2) oc 



M n 



v Y— • (33) 



This prior is improper, requiring us to apply the compact subset normalization described in 
Sec. IIII A 11 Using a straightforward generalization of the nested compact sets of Eq. (Ell), 
we find that the correct reference prior is identical to Jeffreys' prior. 

In order to apply Method 2, we need to specify a proper conditional prior for the /ij and 
Cj given cr. Neglecting correlations, we set: 



M 



7T e , a a) = 1 T v / — — . 34 

t\ r^ + 1/2) r(y< + l/2) 1 ; 

The marginalized data probability distribution p(n\a) is then a product of expressions of 
the form (1281) . one for each count i. 

Here we no longer attempt to obtain analytical expressions for the Method 1 and 2 
reference posteriors. Instead, we use the numerical algorithms described below. 



C. Numerical Algorithms 

In this section we describe numerical algorithms that can be used to compute Method 1 
or 2 reference posteriors for the single- and multiple-count Poisson likelihoods discussed in 
the previous sections. 

For Method 1 the algorithm starts by generating (cr, e, p) triplets from the "flat-prior 
posterior", i.e. the posterior obtained by setting ir(a\e, j2) = 1 (line 3 in the pseudo-code 
below); the correct reference prior n(a \ e,jl) is then computed at lines 4-7 and is used at 
line 9 to weight the generated a values so as to produce the reference posterior: 
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1 Set n to the array of observed event numbers. 

2 For i = 1, . . . , I: 

3 Generate (<7j, e^, /4) ~ p{n \ a, e, fl) 7r(e, jl). 

4 For j = 1,..., J: 

5 Generate Hj ~ p(n | dj, e*j, 

6 Calculate d 2 [— In p(ftj | Cj, ei, j2i)]/daf by numerical differentiation. 

7 Average the J values of d 2 [— \np{n \ &i, e*j, fli)]/da 2 obtained 
at line 6, and take the square root. This yields a numerical 
approximation to the conditional Jeffreys' prior 7rj(<7j | ej,/2j). 

8 Histogram the Oi values generated at line 3, weighting them by 
7rj(cjj | €i,fli)/p{n | o-j, ei,pi). This yields ir R i(a), the a-marginal prior. 

9 Histogram the cr, values generated at line 3, weighting them by 
7Tj(o"i | This yields ttri((J \ n G ), the cr-marginal posterior. 



Although not required for the calculation of the reference posterior, an approximation to 
the reference prior is provided at line 8. By construction this approximation is only reliable 
for a values in the bulk of the flat-prior posterior. The generation step at line 3 is done via a 



Markov chain Monte Carlo procedure [19j |. The particular choice of sampling distribution for 
the generated (a, e, fl) triplets is motivated by the desire to obtain weights with reasonably 
small variance at steps 8 and 9. However, the flat-prior posterior p(no | o", e*, P) 7r(e*, /2) is not 
always proper with respect to (cr, e, /7). When M = 1 (single-count model), it is improper 
if x < 1/2. Propriety can then be restored by multiplying the flat-prior posterior by e 
and correspondingly adjusting the weights at steps 8 and 9. Another feature of the above 
algorithm is that it does not implement the compact subset normalization. In the cases that 
we examined, this procedure made no difference, but this may not be true for more general 
problems than those our code seeks to solve. Unfortunately the current lack of guidelines in 
the choice of compact sets limits our ability to address this issue in the code. 

The algorithm for Method 2 has a simpler structure, since all it does is apply Jeffreys' 
rule to a marginalized likelihood p(n | a) provided by the user. The calculation does not 
require random sampling of the parameters and is done at fixed a values. For a given a, the 
reference prior n^ic) is obtained by Monte Carlo averaging, over an ensemble of vectors 
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ft generated from p(n \a), of an accurate numerical approximation of the second derivative 
of the negative log-likelihood |20[. As already pointed out, Method 2 does not require 
a compact subset normalization procedure. The reference posterior is thus proportional 
to the product of p(n \ a) and vr^ 2 (c"), and the normalization with respect to o must be 
determined numerically. 

IV. VALIDATION STUDIES 

We have performed a number of studies to validate inferences from the single-count model, 
using both the numerical algorithms described in Sec. IIII CI and analytical expressions we 
obtained for the marginal Method-1 and 2 posteriors for a. To recapitulate, we have two 
reference priors for this model: 



and we have assumed that 7r(e, /i | a) = 7r(e, /i) at Eq. (1T7|) . As explained in Sec. IIII} this extra 
assumption affects only the definition of ttr2, which therefore incorporates more information 
than tcjii. In the present section we study and compare the properties of these two reference 
priors. To begin, we show some example prior and posterior a marginals in Fig. |2j As 
expected, posteriors corresponding to a small observed number of events favor small cross 
sections, and posteriors derived from flat priors put less weight on small cross sections than 
reference posteriors. 

Our derivations of the two reference prior methods made use of Jeffreys' rule (|T5l) . As 
pointed out in Sec. Ill this approach assumes that some regularity conditions are satisfied, 
such that the resulting posterior is asymptotically normal. We now wish to verify this 
assumption with a graphical example. If one adopts the objective Bayesian view that the 
parameters a, e, and [i have true values, then the asymptotic limit can be defined as the 
result of a large number Nr of replications of the measurement, in the limit where that 
number goes to infinity. For the case where each measurement replication i consists of 
a number of events rti drawn from a probability mass function f(n | cr, e, /i), the reference 




(35) 




(36) 
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FIG. 2: Left: marginal Method-1 and 2 priors, normalized to 1 at cr = 1. Right: marginal Method- 
1 and 2 posteriors for 0, 1, and 4 observed events, together with the posteriors obtained from a 
flat prior. The e and fx priors have a mean of 1 and a 20% coefficient of variation (corresponding 
to x = y = 24.5 and a = b = 25 in Eq. (fT8jl ). Here and in subsequent plots, the units of a are 
arbitrary but consistent with those of e; e.g., if the latter is expressed in pb _1 , then a is given in 
pb so that, like /i, the product of e and a is dimensionless. 

posterior has the form: 

N R 

n R {a,e,iJ,\nx,n2,...,n NR ) oc n R (a, e, //) f(m \ a, e, //), (37) 

i=l 

and the reference prior TT R (a, e, //) is calculated from the combined likelihood for the TVr 
measurements; it can also be calculated from a single one of these likelihood functions, since 
it follows from their constructive definition (j!4| that reference priors are independent of 
sample size. For Method 1 the prior is given by the product of Eqs. ( IT71) and ( 1251) . and the 
likelihood component f(n\a,e, fi) by Eq. (fl~6l) . Replicating the measurement times is 
then equivalent to making a single measurement with a Poisson likelihood whose mean is 
Nr times the original mean, and whose observation is the sum of the original obser- 
vations rii. This property of Poisson measurements simplifies the calculations considerably. 
For Method 2 the prior is given by Eq. ( 13"U|) and the likelihood by Eq. ( 12"S|) . In this case 
no simplification obtains when considering multiple replications, and numerical calculations 
must use explicitly the full product of likelihood functions. Figure [3] illustrates the calcu- 
lations with the help of so-called Q-Q plots, where recentered quantiles from the reference 
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FIG. 3: Q-Q plots of the Method-1 conditional reference posterior (left) and the Method-2 marginal 
reference posterior (right) versus the standard normal distribution. The reference posterior quan- 
tiles have been recentered (see text). Nr is the number of measurement replications and ntot is 
the observed number of events summed over all replications. 

posterior for a are plotted against standard normal quantiles. The posterior quantiles 
are recentered according to Q n = (Q n — (a))/Aa, where the posterior mean (a) and stan- 
dard deviation Aa are numerically estimated. For Method 1 we set the true values of a, e, 
and fi to 1. We then randomly generate a sequence of 100 independent measurements from 
the probability mass function (fl~6l) and use the subsequences with Nr = 1, 10, and 100 to 
produce the curves in the left panel. For Method 2 we set the true value of o to 1 and give 
the priors for e and /i each a mean of 1 and a coefficient of variation of 20%. Measurements 
are then generated from the probability mass function (128]) in order to compute the curves in 
the right panel. Both panels clearly show that the respective reference posteriors approach 
a Gaussian shape as the number of measurement replications increases. 

Given the almost negligible difference between Method-1 and 2 posteriors exhibited in 
Fig. [21 and the fact that our analytical results for Method 1 are computationally more 
tractable than those for Method 2, our considerations in the remainder of this section will 
focus exclusively on Method 1. 

Among the reference prior properties listed in Sec. Ull the ones of generality, invariance, 
and coherence are true by construction. The property of sampling consistency needs more 
elaboration however, since Bayesian inferences do not generally coincide with exact fre- 
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quentist ones, and a proper evaluation requires first of all the specification of an ensemble 
of experiments. A well-known property of Bayesian posterior intervals constructed from a 



proper prior is that their coverage is exact when averaged over the prior [2l|. This is an 
immediate consequence of the law of total probability. Indeed, given a parameter 9 with 
proper prior tc(9), and a measurement X, the prior-averaged frequentist coverage of a 1 — a 
Bayesian credibility interval R{X) can be written as: 

E w [P(fl G R{X) | 9)] = P(0 e R(X)) = E m [¥(9 e R(X) \ X)} = E m (l - a) = 1 - a, (38) 

where the first expectation is over the prior it (9) and the second one over the marginal 
sampling distribution m(x) = J p(x\9) n(9)d9. When tt(9) is a reference prior, and especially 
when it is improper, there is no natural metric over which the coverage can be averaged. 
The only sensible approach in that case is to study the coverage pointwise, i.e. as a function 
of the true value of 9. Since the single- and multiple-count models discussed in this paper 
combine an improper prior for the parameter of interest a with proper priors for the nuisance 
parameters e and ft, we will study interval coverage for a fixed value of a, but averaged 
over Tr(e,/T). Our interest is in how this coverage evolves toward the asymptotic limit. 
As before, we take this limit in the sense of an ever- increasing number Nr of experiment 
replications. For a given value of Nr, the posterior is formed as in Eq. (157)) and its coverage 
is computed. Figure H] shows the coverage of 95% credibility upper limits and 68% credibility 
central intervals as a function of Nr. As the latter increases, the coverage converges to the 
credibility, confirming the sampling consistency of the method. 

Finally, we examine the behavior of reference posterior upper limits on a as a function of 
the expected background j2 (defined in Eq. ( TT8)) ) when the observed number of events n is 
small. For comparison, when n = and there are no uncertainties on signal efficiency and 
background, frequentist upper limits decrease linearly with background. From a Bayesian 
point of view this result is surprising. Indeed, when zero events are observed the likeli- 
hood function factorizes exactly into background and signal components, indicating that 
the experiment actually performed can be analyzed as the combination of two independent 
experiments, one to measure background and the other signal. If, in addition, signal and 
background are a priori independent, then posterior inferences about signal will be inde- 
pendent of background. In particular, upper limits on a will be constant as a function of 
/2, not linearly decreasing. The reference priors entangle signal and background however, so 
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FIG. 4: Coverage probability of Method-1 posterior credibility upper limits (left) and central 
intervals (right), as a function of the number of experiment replications Nr. The solid lines 
indicate the nominal credibility. 
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FIG. 5: Variation of the Method 1 reference posterior upper limit with mean background for several 
values of the observed number of events n. The relative uncertainty on the background and on the 
effective luminosity is 20% for the left plot and 50% for the right one. 

that upper limits will not be exactly constant. The n = case is illustrated in Fig. |5]for 
two values of the relative uncertainties on background and signal efficiency. For n > the 
likelihood function still factorizes approximately since (/i + ea) n ~ yU n (l + nea/fi) ~ fi n for 
fi ^> ea, n. Thus upper limits will flatten out at large fi, as seen in Fig. A comparison 
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of the left and right panels in that figure also shows that upper limits increase with the 
uncertainty on background and signal efficiency, as expected. 



V. MEASUREMENT OF THE SINGLE TOP QUARK CROSS SECTION 

In this section, we demonstrate the computational feasibility of the methods described 



above by applying them to t 
DO and CDF collaborations 



re recent measurement of the single top cross section by the 



22 



23J. Both collaborations use the same form of likelihood 
function — a product of Poisson distributions over multiple bins of a multivariate discrim- 
inant, the same form of evidence-based priors, namely truncated Gaussians, and flat priors 



for the cross section [241 ] . As a realistic example, we construct the reference prior for the 
cross section using one of the data channels considered by DO. 

DO partitioned their data into 24 channels, defined by lepton flavor (electron or muon), 
jet multiplicity (two, three, or four), number of 6-tagged jets (one or two), and two data 



collection periods. The discriminant distribution is shown in Fig. 3 of Ref. [22j. Here we 
consider the electron, two-jet, single-tag channel from one of the data taking periods. The 
discriminant distribution contains about 500 counts spread over 50 bins, with a maximum 
bin count of about 40. 

We model information about the effective integrated luminosity e and the background /i 
for each bin with the help of the gamma priors of Eq. (I17p . These evidence-based priors 
describe the uncertainty due to the finite statistics of the Monte Carlo simulations. We 
do not include systematic uncertainties in this example. Figure |6]^a) shows a comparison 
of the reference prior for the cross section using Methods 1 (the histogram) and 2 (the 
dashed curve). The jaggedness of the Method 1 prior reflects the fluctuations due to the 



Markov chain Monte Carlo [19|| sampling of the parameters. The increased jaggedness at 
large a is due to the fact that the numerical algorithm samples from the flat-prior posterior, 
whose density rapidly decreases in this region. It is noteworthy that the priors computed 
using the two methods are very similar for this particular example. This is also reflected 
in the similarity of the posterior densities, shown in Fig. E(b) . In principle fluctuations in 
the calculated posterior can be made arbitrarily small by increasing the size of the Monte 
Carlo sample. For reference, Fig. El^b) also shows the posterior density using a flat prior 
for the cross section. An obvious conclusion can be drawn: when the dataset is large, here 
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FIG. 6: (a) Prior densities computed using Method 1 (histogram) and Method 2 (dashed curve), 
and (b) corresponding posterior densities, for one of the channels in the DO single top measurement 
and using 2.3 fb _1 of data. For comparison we show the posterior density computed using a flat 
prior (dotted curve). 

of order 500 events, the precise form of the reference prior is not important. However, for 
small datasets — which is typical of searches for new or rare phenomena, one should expect 
the form of the prior to matter. It is then important to use a prior with provably useful 
properties, such as the ones enumerated in Sec. [TT1 

VI. DISCUSSION 

Our main purpose in this paper was to propose a set of formal priors with properties that 
make them attractive for use in the analysis of high energy physics data. Aside from the 
theoretical properties of invariance, coherence and sampling consistency, the reference prior 
method has three important practical advantages: (1) priors can be defined for almost any 
problem, regardless of the complexity of the likelihood function and the number of nuisance 
and interest parameters, (2) in contrast with flat priors, reference priors have so far always 
yielded proper posteriors, and (3) reference priors are computationally tractable, as shown 
by the single-top example. 

Here we have limited our numerical investigations to the class of likelihood functions that 
are derived from Poisson probability mass functions. For this class the Method- 1 reference 
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prior agrees with Jeffreys' rule. For other classes the compact subset normalization argu- 
ment may introduce a difference. A possible generalization of our treatment is to unbinned 
likelihoods. Since our Method-1 and 2 numerical algorithms make no assumptions about the 
likelihood function, they can be generalized to the unbinned case. However, the Method-1 
algorithm does not implement the compact subset normalization and is therefore only ap- 
plicable to cases where this procedure makes no difference. Method 2 requires no compact 
subset normalization but makes an extra assumption about the conditional nuisance prior. 

For problems that involve a single continuous parameter or that can be reduced to this 
case by a Method- 2-type integration, Ref. 12] proposes a numerical algorithm that is based 



directly on Eq. (j!4p and is therefore very general. However we found that this algorithm 
presents some difficulties for the complicated likelihood functions used in high energy physics. 
One difficulty is the round-off error in the product of large numbers of probability densities. 
Another difficulty is the assessment of the convergence of the integrals in the formula, and 
of the convergence of the finite-sample priors to the reference prior. 

Another possible generalization is to problems with more than one parameter of interest, 
as for example in the measurement of the individual single top production cross sections 
in the s and t channels. For this situation the reference prior algorithm requires one to 



sort the parameters of interest by order of importance [12j, and the results may depend on 
this ordering. A possible interpretation of this dependence is that it is a measure of the 
robustness of the result to the choice of prior. This is an area that requires more study. 

We have developed a general-use software package that implements the methods described 
in this paper and have released it to the Physics Statistics Code Repository (phystat.org). 

Finally, we note that the main ideas underlying the construction of reference priors, 
namely generality, reparametrization invariance, coherence, and sampling consistency, have 
motivated the development of methods for summarizing reference posteriors via point esti- 
mates, intervals, and hypothesis tests. This subfield of objective Bayesianism is known as 



reference analysis 
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